Fix OpenMP thread allocation in Linux #5551

svotaw · 2022-10-19T22:38:35Z

For the streaming push APIs that are designed to work multi-threaded, we observed consistent failures in Linux. (e.g LGBM_DatasetPushRowsByCSRWithMetadata). These APIs rely on pre-allocating batches of sparse push arrays to avoid thread collisions. To do the initial allocation during LGBM_InitStreaming, we were using OMP_NUM_THREADS(). This works just fine on Windows, where tested.

However, in Linux, it appears that each calling thread uses its own space to determine OpenMP thread statistics. Neither OMP_NUM_THREADS() nor omp_max_num_threads() return the same value for different calling threads.

To fix this, this PR changes to use a static MAX_THREAD_ALLOCATION to simply pre-allocate based on a known constant thread space. This wastes a small amount of memory due to overallocation (empty vectors), but testing confirms that it now works to run multi-threaded in Linux as well.

Note that there are multi-threaded unit tests to cover this, but due to the fact that OpenMP is not supported in the C++ unit tests framework, the test threading is implemented manually and hence always succeeds. This issue is only seen when OpenMP is active and we are running in Linux.

Example:

2 external Spark Tasks are calling LGBM_DatasetPushRowsByCSRWithMetadata to load a Dataset in parallel. Before pushing, Spark Task thread 0 calls LGBM_InitStreaming with an expected 2 external calling threads (1 for each Spark Task). OMP_NUM_THREADS() returns 3 threads, so the sparse bins allocate 6 independent push buffers (2 external threads * 3 OpenMP threads per external thread) to avoid thread collisions.

The expected internal_thread_id range of Spark Task thread 0 is 0-2 and the range of Spark Task thread 1 is 3-5, and these are used to index the correct thread-safe sparse push buffer.

However (in Linux specifically) Spark Task thread 1 OMP_NUM_THREADS() returns 4, a different value than what Spark Task thread 0 saw during initial allocation (3). This results in Spark Task thread 1 calling LGBM_DatasetPushRowsByCSRWithMetadata and thinking that it's internal_thread_id range is 4-7 (as opposed to 3-5). This causes a JVM crash since trying to index sparse push buffer 6 or 7 is out of range.

Also observed: even if you add the ability to dynamically add push buffers (to avoid out-of-range), you still get indexing collisions. 2 external threads can end up using the same internal_thread_id due to the way we calculate it. The OMP_NUM_THREADS() is expected to be a constant to generate non-overlapping OpenMP tid ranges, but in Linux it seems to vary depending on which thread called it.

svotaw · 2022-11-08T18:15:19Z

@guolinke @shiyu1994 Just checking in for updates. We are awaiting this fix to unblock a release.

shiyu1994

Thanks for the fix. Just left two quick questions. Could you please help to address? I'll make this merged as soon as possible.

include/LightGBM/utils/openmp_wrapper.h

guolinke · 2022-11-10T08:53:12Z

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

svotaw · 2022-11-10T19:25:01Z

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT?

However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results.

svotaw · 2022-11-10T19:25:20Z

@svotaw will force set export OMP_NUM_THREADS=8 fix the problem in linux?

Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT?

However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results.

guolinke · 2022-11-12T02:47:31Z

okay, I see. Another solution I suggest is to dynamically increase the buffer, when tid is larger than buffer size. But this solution may slow down due to the additional if-else branching.

svotaw · 2022-11-12T19:00:37Z

okay, I see. Another solution I suggest is to dynamically increase the buffer, when tid is larger than buffer size. But this solution may slow down due to the additional if-else branching.

Correct, I didn't want to adversely affect a shared critical pathway with an extra comparison, so I would rather avoid using dynamic buffer size. Also, without a fixed size, you can't generate deterministic thread ids across threads, so it ends up having thread collisions anyways.

I'm taking a better look at the OpenMP APIs now that I understand the issue better and you guys have pointed out some good responses. Should have something up today.

svotaw · 2022-11-13T03:34:22Z

After a more thorough investigation, here's what I found with OpenMP:

As a reminder, what we need is for a constant that represents how many OpenMP threads each external thread will create. With this constant, we can pre-allocate sets of buffers (i.e., 1 set of X buffers for each of Y calling threads, and we need to find X and are given Y).

The ideal candidate would be omp_get_max_threads(). However, this returns a different number depending on the calling thread (our test cases get 3 and 4 respectively on 2 external threads). This seems to only happen in Linux, but that could be coincidence because of something else. They even say in the docs that you should use this if you need to make allocations, but apparently, they weren't considering an external multi-threaded environment. It kinda makes sense since OpenMP would be the multi-threading solution already for most people.

I tried all the APIs that OpenMP has to tweak thread counts: OMP_DYNAMIC, OMP_THREAD_LIMIT, OMP_NUM_THREADS, etc.
(Note: OpenMP actually recommends NOT looking at env vars directly, and using their APIs to access params, so although we discussed using env vars above, OpenMP already exposes APIs for all their params that also take into account defaults and such)

omp_set_dynamic() to either 0 or 1 had no effect. Even with dynamic == 0 (which was actually the default), we get different values for omp_get_max_threads() from each external thread.
omp_get_thread_limit() was not much use. By default, it actually returned Max(int32), which is not useful for allocation. I see no reason for us to set this manually either to something like 128 or 1024. Docs say default is implementation dependent.
I used omp_get_active_level() to make sure there wasn't some weirdness of nesting somehow, but both external threads were on same level.
as @guolinke suggested, using omp_set_num_threads(X) or parallel numThreads(X) worked, and omp_get_max_threads() returned a constant X after that. However, I don't like this solution. OpenMP and users should be free to choose thread count based on hardware and other needs, rather than us hardwiring it. We might interfere with users setting the values to other things that they want.

Here's my thoughts for a flexible solution that is better than the static constant from the first iteration of this PR:

We create a "new" OpenMP param for LightGBM: omp_get_streaming_max_threads(), and we default it to some reasonable value like 8 or 16. This should work in the vast majority of cases, with no user actions needed, and reduces wasted allocations (from 128 or even 1024).
For those special cases where this doesn't work (OpenMP has enough resources to create more threads than that), we allow overriding the default with an env var: OMP_MAX_STREAMING_THREADS. This is kind of how OpenMP already works, with params and overrides.

See the latest iteration for an example of this suggestion. We can still debate details of caching, naming, location of env var utils, etc., but this works from the more extensive testing I did. It works out-of-the-box, and still is flexible for corner cases.

svotaw · 2022-11-13T17:58:23Z

@guolinke Actually, having looked at it and thought about it for another day, it might be better to choose another way to set this value. Since this isn't really an OpenMP setting, perhaps it doesn't make sense to set it in openmp_wrapper. Also, I don't really like the static var.

I can think of 2 other options:

As an optional argument to LGBM_InitStreaming(), defaulting to 8 or 16, and stored as a private class var in Dataset.
As a config option, defaulting to 8 or 16 and again stored as private class var in Dataset

I will implement #1 in another iteration when I get a chance.

guolinke · 2022-11-14T06:00:48Z

@svotaw Thank you, I agree with #1

svotaw · 2022-11-14T06:41:40Z

@svotaw Thank you, I agree with #1

@guolinke Done. Ready for final review then.

svotaw · 2022-11-14T23:07:32Z

@guolinke @shiyu1994

I think this is the best fix, although of course you are free to comment. I believe it's ready for checkin, other than your comment on whether we should use 8 or 16 as a default (see TODO in dataset.h). Note the failing R test is most likely a flake.

shiyu1994 · 2022-11-15T17:14:13Z

It seems that some cpp tests are failing. Will look into this tomorrow.

include/LightGBM/dataset.h

src/c_api.cpp

guolinke

Thank you, LGTM!

guolinke · 2022-11-21T03:10:30Z

Let us fix the CI next

svotaw · 2022-11-21T16:40:51Z

Thanks for the signoff and feedback! Just to be clear, is this a general CI problem, or something related to this PR? (I assume the former, but making sure)

jameslamb · 2022-11-21T16:45:44Z

is this a general CI problem, or something related to this PR

Please update to latest master. If you see any CI failures after doing that, they're likely related to this PR's changes, and you'll need to address them.

To be clear, we never merge PRs in this repo where CI is failing, even if the failures seem unrelated to the PR's changes.

svotaw · 2022-11-21T21:04:25Z

sure, I guess I meant is this something I need to merge a fix for or something that the backend pipeline just needs to be re-run? I merged from master.

svotaw · 2022-11-21T22:24:44Z

After sync with main, seeing the below error in a few R tests. I will try a re-run to confirm it's real.

The downloaded binary packages are in
C:\Users\runneradmin\AppData\Local\Temp\RtmpgztNMe\downloaded_packages
Downloading https://github.com/microsoft/LightGBM/releases/download/v2.0.12/miktexsetup-5.2.0-x64.zip
Setting up MiKTeX
miktexsetup_standalone: Timeout was reached
miktexsetup_standalone: Data: code="28", url="https://api2.miktex.org/hello"
Error: Process completed with exit code -1.

svotaw · 2022-11-22T06:32:30Z

I fixed all the tests, except for the R tests which are failing due to MikTeX failing somehow. If that's something related to this PR, can someone point me to a possible cause? I don't know anything about MikTeX.

jameslamb · 2022-11-23T01:51:33Z

except for the R tests which are failing due to MikTeX failing somehow

That looks unrelated to this PR's changes, as we're seeing it elsewhere on PRs and master. See #5600.

svotaw · 2022-11-29T17:43:38Z

@jameslamb Can this be checked in?

jameslamb · 2022-11-29T17:51:29Z

I'll merge this, based on @guolinke 's approval.

Normally I'd label a change like this, which adds a new required argument to a public API in c_api.h, a "breaking" change, but I think fix is a fine categorization given that the LGBM_DatasetInitStreaming() was just introduced in #5299 and has never been part of a LightGBM release.

Thanks for the help!

github-actions · 2023-08-19T03:02:57Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Fix OpenMP thread allocation in Linux

0579b56

svotaw requested review from guolinke and shiyu1994 as code owners October 19, 2022 22:38

svotaw added 2 commits October 19, 2022 18:25

move constant out of ifdef

8adfab2

edit comments

c478523

svotaw mentioned this pull request Oct 23, 2022

feat: [LightGBM] support dataset rows more then max(int32_t) microsoft/SynapseML#1684

Open

5 tasks

StrikerRUS added the fix label Oct 23, 2022

jameslamb added the awaiting review label Oct 29, 2022

shiyu1994 reviewed Nov 9, 2022

View reviewed changes

include/LightGBM/utils/openmp_wrapper.h Outdated Show resolved Hide resolved

switch to OMP_GET_STREAMING_MAX_THREADS

c82f5f9

svotaw added 3 commits November 13, 2022 14:21

switch to using InitStreaming param

ef7bade

fix comment

5f793bd

style fix

1b94aad

svotaw requested review from shiyu1994 and removed request for guolinke November 14, 2022 23:02

jameslamb mentioned this pull request Nov 15, 2022

[ci] [R-package] r-devel debian jobs failing: command 'tidy' not found #5587

Closed

guolinke reviewed Nov 18, 2022

View reviewed changes

include/LightGBM/dataset.h Outdated Show resolved Hide resolved

guolinke reviewed Nov 20, 2022

View reviewed changes

src/c_api.cpp Outdated Show resolved Hide resolved

guolinke reviewed Nov 20, 2022

View reviewed changes

src/c_api.cpp Outdated Show resolved Hide resolved

svotaw added 2 commits November 19, 2022 22:18

default to -1

b812cb0

style fix

3c9375d

svotaw requested review from guolinke and removed request for shiyu1994 November 20, 2022 06:27

svotaw added 2 commits November 19, 2022 22:52

style fix

ff33af5

adjust initialization

4b0e07f

guolinke approved these changes Nov 21, 2022

View reviewed changes

Merge branch 'master' into omp-thread-fix

9f1ce3c

svotaw added 2 commits November 21, 2022 14:26

comment edit

28098e6

fix test

ab67eef

Merge remote-tracking branch 'origin/master' into omp-thread-fix

d87f590

jameslamb removed the awaiting review label Nov 29, 2022

jameslamb changed the title ~~fix: Fix OpenMP thread allocation in Linux~~ Fix OpenMP thread allocation in Linux Nov 29, 2022

jameslamb merged commit 4c5d0fb into microsoft:master Nov 29, 2022

svotaw mentioned this pull request Nov 30, 2022

PushRows APIs are not thread safe for sparse data #5383

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenMP thread allocation in Linux #5551

Fix OpenMP thread allocation in Linux #5551

svotaw commented Oct 19, 2022 •

edited

Loading

svotaw commented Nov 8, 2022

shiyu1994 left a comment

guolinke commented Nov 10, 2022

svotaw commented Nov 10, 2022

svotaw commented Nov 10, 2022

guolinke commented Nov 12, 2022

svotaw commented Nov 12, 2022

svotaw commented Nov 13, 2022

svotaw commented Nov 13, 2022 •

edited

Loading

guolinke commented Nov 14, 2022

svotaw commented Nov 14, 2022

svotaw commented Nov 14, 2022

shiyu1994 commented Nov 15, 2022

guolinke left a comment

guolinke commented Nov 21, 2022

svotaw commented Nov 21, 2022

jameslamb commented Nov 21, 2022

svotaw commented Nov 21, 2022

svotaw commented Nov 21, 2022

svotaw commented Nov 22, 2022

jameslamb commented Nov 23, 2022

svotaw commented Nov 29, 2022

jameslamb commented Nov 29, 2022

github-actions bot commented Aug 19, 2023

Fix OpenMP thread allocation in Linux #5551

Fix OpenMP thread allocation in Linux #5551

Conversation

svotaw commented Oct 19, 2022 • edited Loading

Example:

svotaw commented Nov 8, 2022

shiyu1994 left a comment

Choose a reason for hiding this comment

guolinke commented Nov 10, 2022

svotaw commented Nov 10, 2022

svotaw commented Nov 10, 2022

guolinke commented Nov 12, 2022

svotaw commented Nov 12, 2022

svotaw commented Nov 13, 2022

svotaw commented Nov 13, 2022 • edited Loading

guolinke commented Nov 14, 2022

svotaw commented Nov 14, 2022

svotaw commented Nov 14, 2022

shiyu1994 commented Nov 15, 2022

guolinke left a comment

Choose a reason for hiding this comment

guolinke commented Nov 21, 2022

svotaw commented Nov 21, 2022

jameslamb commented Nov 21, 2022

svotaw commented Nov 21, 2022

svotaw commented Nov 21, 2022

svotaw commented Nov 22, 2022

jameslamb commented Nov 23, 2022

svotaw commented Nov 29, 2022

jameslamb commented Nov 29, 2022

github-actions bot commented Aug 19, 2023

svotaw commented Oct 19, 2022 •

edited

Loading

svotaw commented Nov 13, 2022 •

edited

Loading